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CHAPTER 1 


INTRODUCTION 
1.1 The Microburst 

Since the late 1970’s, a form of low altitude windshear, called a microburst, 
has become recognized as a significant hazard to aircraft during takeoff and landing. 
The term microburst was first used by Fujita to describe a rapid downflow of air 
that upon impacting the ground forms a horizontal outflow extending less than 4 
km in diameter (Figure 1.1) [1]. Observations of roughly 75 microbursts during 
the Joint Airport Weather Studies (JAWS) Project conducted at Denver’s Stapleton 
International Airport in 1982 revealed that they have a relatively short life span, 
usually less than 15 minutes, with severe shear lasting only 2 to 4 minutes. Within 
that small window of time and distance, however, headwind to tailwind differentials 
can approach 100 MPH. Under such conditions, a pilot may have less than 30 seconds 
to make a successful recovery [2]. 

Typically, when an aircraft encounters a microburst at low altitude, it first experi- 
ences a performance increasing headwind, followed by a severe downdraft, and finally 
a performance decreasing tailwind. The rapid loss of altitude caused by the down- 
draft and the ensuing loss of altitude and airspeed due to the tailwind can bring an 
aircraft too close to the ground with too little airspeed to escape a crash. It has been 
determined from measurements taken in the JAWS Project that some microbursts 
are so severe that once entered, there is no chance for recovery [2]. Compounding 
the problem is the deceptively benign visual appearance of many microbursts. The 
warning signs of a microburst can be as subtle as a column of rain associated with 
the downdraft or, if the rain evaporates above the ground, packets of windblown dust 
near the ground. It should be noted that windshear detection devices are presently 
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installed on some commercial aircraft, but these are so called reactive systems. Re- 
active systems are based on in situ devices and can only detect a hazard once it has 
been entered. 

The need to provide advanced warning of windshear conditions has spurred a joint 
effort between NASA and the FAA to develop a new generation of airborne Doppler 
weather radar operating in the X-band (8-12GHz). Doppler weather radar is actually 
one of several candidate sensors, including lidar and infrared, for a forward-looking 
airborne microburst detector. Each of these sensors requires similar signal processing, 
but they rely on different scattering mechanisms [3]. Since the predominant type of 
scattering target may vary from small particles of dust in one microburst to droplets 
of rain in another microburst, a combined system incorporating all three of the above 
sensors may be necessary to provide reliable detection under all atmospheric condi- 
tions. The upshot is that sophisticated signal processing will be necessary to identify 
microbursts in real time. 

Part of the signal processing effort involves isolating a “signature” for microbursts 
that can be used to indicate potentially hazardous situations. The horizontal compo- 
nent of the windspeed versus range for the headwind/downdraft/tailwind sequence 
gives what is called the “S”-curve. Beginning at ranges nearest to the aircraft, the 
windspeed has a large negative value in the headwind, diminishing to zero at the 
heart of the downdraft, and becoming a large positive value in the tailwind. The 
“S”-curve is a signature detectable by a pulse Doppler radar capable of estimating 
average windspeed at various ranges. This sensor coupled with some form of pattern 
recognition scheme may be able identify the “S”— curve due to a microburst in the 
flight path. Figure 1.2 shows an “S”-curve generated by plotting mean windspeed es- 
timates versus range for a simulated microburst generated by the Airborne Windshear 
Doppler Radar Simulation (AWDRS) Program [4]. 

Another form of hazard indication that has gained wide acceptance is the “ F 
factor proposed by Bowles and Targ [5]. It is an index independent of the aircraft 
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Figure 1.2 u S”-curve for a simulated microburst. § 
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weight and thrust capability and is based upon measurable atmospheric conditions. 
The “F” -factor is defined as 



where: 

W x = spatial derivative of the horizontal component of wind velocity, 

Wh = vertical component of wind velocity, 
g = acceleration due to gravity, 

V = the airspeed of the aircraft. 

To get a feeling for what the “F”-factor means in physical terms, consider that 
a positive spatial derivative of the horizontal wind velocity indicates an increasing 
tailwind which has a performance decreasing effect, while a positive vertical wind 
velocity indicates a performance increasing updraft. Threshold values for F have 
been placed between 0.1 and 0.15 [5]. Figure 1.3 shows how “F”-factor can be 
interpreted as an indicator of safe operation. 

1.2 Pulse Doppler Radar 

Pulse Doppler radar is a sensor of particular interest in the windshear detection 
problem. This is not surprising since these instruments have been used for years in 
meteorological research of storm dynamics, hydrology , and cloud and precipitation 
physics [3, 6, 7, 8, 9]. 

1.2.1 Principles of Operation 

The block diagram in Figure 1.4 shows the basic elements of a pulse Doppler 
radar. Beginning with the transmitter, a high power amplifier is used to produce 
a spectrally clean signal at a radio frequency f t . A pulse modulator switches the 
amplifier on and off generating a train of pulses of duration r and separated by the 
interpulse period (IPP), the reciprocal of the pulse repetition frequency (PRF). Figure 
1.5 shows a conceptual drawing of the transmitted signal. 
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Figure 1.4 Block Diagram of a Doppler radar. 
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At the antenna, the pulse train is radiated as electromagnetic energy which is 
backscattered by targets in the path of the beam. In most cases, a single antenna is 
shared by the transmitter and receiver to reduce the space, weight, and cost require- 
ments. In a shared antenna radar, referred to as monostatic, a duplexer is used to 
switch the antenna from the transmitter to the receiver during the period of time that 
the amplifier is turned off, (IPP — r). After a small delay to assure decoupling from 
the transmitter, the receiver “listens” for backscattered returns from the previous 
pulse. 

The range of a target can be determined by measuring the time it takes for a 
pulse to travel to the target and have its reflected energy return to the receiver. The 
receive time between pulses can then be divided into time segments called range cells 
which indicate the range of returns that fall into a particular time interval (Figure 
1.6). The maximum unambiguous range is given by 


Rmax — 


2 PRF' 


where c is the speed of light. 

By taking advantage of the Doppler shift principle, a pulse Doppler radar can 
also measure the radial velocity of a target. When an electromagnetic wave with 
frequency f t is reflected by an object moving away from the radar platform at a 
relative velocity v, the frequency seen at the receiver is shifted by 


u = - 


Figure 1.7 illustrates the Doppler effect. The maximum unambiguous Doppler veloc- 


ity is given by 


Vmax = PRF 


4 ur 


With the aid of signal processing then, a pulse Doppler radar can map the radial 
velocities of targets out to its maximum unambiguous range. 
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Figure 1.6 Gated return for a pulse Doppler radar. 
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The targets that a weather radar is depending on as reflectors are windblown 
particles such as rain, insects, dust, or changes in the refractive index. Scattering 
targets that are swept along with the air mass and are illuminated by the radar pro- 
vide information about the winds within the resolution volume. By the central limit 
theorem, the return signal of a large number of randomly distributed targets can be 
modelled by a narrowband Gaussian process. The effects of windshears, turbulence, 
and antenna motion combine to broaden the Doppler spectral width, but do not in- 
validate the Gaussian approximation [3, 10, 11]. Ideally, the Doppler power spectrum 
might look like the one represented in Figure 1.8 where the mean represents the av- 
erage radial velocity of particles in the illuminated volume and the variance is the 
spread of the Doppler velocities. Estimates of these parameters are performed in the 
signal processing portion of the radar. 

1.2.2 Microburst Detection 

The detection of microbursts with a pulse Doppler radar is complicated by several 
factors. One of these factors is the low reflectivity levels exhibited by certain types 
of microbursts. Preliminary research has been directed towards microbursts with 
significant levels of precipitation known as wet microbursts [5, 12]. The classification 
of wet and dry microbursts is rather qualitative, but a rainfall rate greater than 25 
dBZ generally denotes a wet microburst and a rainfall rate less than 20 dBZ generally 
denotes a dry microburst. The strong return levels from wet microbursts make them 
the simplest case for evaluating various detection algorithms. Dry microbursts will 
call for more sophisiticated signal processing since the primary sources of reflections, 
particles of dust and insects, exhibit a much lower reflectivity of radar energy. 

Along with the possibility of a low energy weather returns, the problem of mi- 
croburst detection by an airborne radar is complicated by the presence of strong 
ground return or clutter. When the aircraft is in low altitude flight, such as during 
takeoff or landing, a portion of the antenna beam is likely to illuminate objects on 
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Figure 1.8 Doppler power spectrum of a Gaussian process. 
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the ground such as buildings, trees, or cars on a freeway. The return from these 
powerful reflectors can utterly obliterate any returns due to weather [10]. Most of the 
clutter energy appears around zero Doppler referenced to the aircraft ground speed 
and is the result of strong returns from stationary objects through the main beam 
of the antenna. Additional discrete clutter due to returns from moving objects on 
the ground or returns from large objects through the antenna sidelobes may appear 
at frequencies shifted away from zero Doppler. Figure 1.9 shows the Doppler power 
spectrum of range cell data generated by the AWDRS program [4]. Notice the narrow 
clutter mode located near 0 m/s and the broader weather mode located around 10 
m/s. The clutter problem may be partially abated by tilting the antenna higher with 
respect to the glide slope so that less of the beam is directed towards the ground, but 
then the beam is directed more in the downdraft portion of the microburst, running 
the risk of missing the highly characteristic horizontal outflow. The upshot is that 
some form of clutter rejection digital filtering may be necessary to suitably enhance 
the signal to clutter ratio [13, 14, 15]. 

1.2.3 Signal Processing 

Estimation of weather parameters is typically performed on a per range cell basis. 
To improve the accuracy of the estimate, sets of complex sampled data points are 
collected from the radar IF output over a number of pulses for each range cell. The 
number of points in each range cell record is determined by the length of time over 
which the statistical properties of the targets can be assumed to be stationary. 

The conventional approach to processing radar signals has been to use a series 
of special purpose arithmetic units followed by a programmable microcontroller for 
processing each range cell [16]. The typical stages of the range cell computation for 
a pulse Doppler weather radar might include some form of clutter rejection, estima- 
tion of the Doppler power spectrum and/or estimation of spectral parameters, and a 
suitable detection algorithm. The most popular technique for computing the Doppler 
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Figure 1.9 Doppler power spectrum of a typical range cell. 



power spectrum is the fast Fourier transform (FFT) which has been implemented on 
high speed, programmable digital signal processing (DSP) chips. The mean wind- 
speed and spectral width can be computed from the Doppler power spectrum, or 
obtained direct^ using the pulse-pair algorithm [17, IS, 19]. 

Since each of the range cells can be processed independently using the same set 
of algorithms, this processor can operate on range cells in either pipeline fashion or 
in parallel depending on the throughput requirements. To make the computation 
fully parallel, a duplicate processing bank for each range cell is provided. Special 
purpose hardware has been necessary until recently to achieve real time data rates. 
A step towards greater flexibility has been made with the programmable DSP board 
which takes advantage of multiple high speed DSP chips. Many of these boards are 
programmable in high level Languages such as C. DSP boards are a powerful tool for 
radar signal processing, but they are, to an extent, still special purpose devices, and 
as such they can require significant development time. With the recent advances in 
computer technology and parallel processing, it has become feasible to perform real 
time radar signal processing on arrays of general purpose microprocessors. 

1.3 Parallel Processing 

Radar signal processing lends itself well to concurrent computation when the 
problem is decomposed by range cells. This is a natural way to break down the com- 
putation because interprocessor communication is unnecessary for range cell depen- 
dent products. Message passing is only necessary for distributing the data throughout 
the network and recollecting the results. In addition, a balanced computational load 
across all the processors is assured when the range cells are equally distributed. 

Within the sphere of concurrent processing, there are a number of issues to be 
addressed. First consider the size of each individual processing element, known as the 
grain size. The grain size will be dictated by the complexity of the necessary algo- 
rithms and the memory requirements for each node. For the problem decomposition 
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described above, each element will be responsible for processing one or more range 
cells, each composed of a set of complex data points . The size of the range cell data 
records and the need for sophisticated signal processing algorithms calls for a large 
grain size parallel computer. 

Next there is the choice of Single Instruction Multiple Data (SIMD) or Multiple 
Instruction Multiple Data (MIMD) architectures. On a SIMD machine, all the nodes 
execute the same sequence of instuctions on their respective sets of data. An example 
of SIMD architecture is the systolic array where a number of special purpose small 
grain elements are interconnected to perform a specific function such as matrix mul- 
tiplication. This approach is reasonable for the radar application because each range 
cell is to receive exactly the same processing, but typically these machines are small 
grain size and/or massively parallel, consisting of I6K or more processors. A SIMD 
computer with a smaller number of large grain size nodes is a feasible architecture, 
but they are not generally available because for a large grain size a MIMD computer 
can offer essentially the same processing capability with greater flexibility [20]. The 
majority of large grain parallel computers are of the MIMD class where each node 
stores and executes its own instruction set independently. 

MIMD computers can further be classified as shared memory or distributed mem- 
ory. Shared memory computers have a common bus and memory for all the nodes. 
They use their memory more efficiently and are easier to program, but all the mi- 
croprocessors have to compete for the same resources. This leads to bus contention, 
severely limiting the number of processors possible for a potential system. With dis- 
tributed memory, each node has its own local memory and bus, thereby eliminating 
the problem of bus contention. Communication takes place through message passing, 
which can only exist between nodes that are directly connected through links. Con- 
ceptually, a shared memory machine can have “full interconnect” capability among 
its processors by communicating through common memory locations [20]. However, 


16 


llli I MV Ml III Mill li Mlllllll *1 II HIM rihim III! ■ i.n 



there should be no significant drawback to the distributed memory machine as op- 
posed to the shared memory machine in this respect for the radar signal processing 
application, since very little interprocessor communication is required when range cell 

decomposition is used to parallelize problem. 

In addition to sequential and parallel processing, a third possibility is vector 
processing which can be considered a hybrid of the sequential and parallel schemes. 
These machines are optimized to operate on whole vectors of data at one time. By 
employing vector registers, fully pipelined vector functional units, and fully pipelined 
vector loads and stores, they are able to reduce the number of instruction and memory 
fetches, eliminate memory latency with register to register operations, as well as 
achieve arithmetic speedup. In the past, vector processors were mostly limited to 
use in super computers such as the Cray Y-MP, but more recently they have become 
available for more general purposes [21]. 

1.4 Problem Statement 

In order to avoid a potential windshear hazard, it has been estimated that a 
pilot is going to need an advanced warning time of 15 to 40 seconds [12]. A pulse 
Doppler radar offers the potential for a look ahead capability over this time period, 
but to get the relevant information to the pilot in a timely manner, the required signal 
processing much be accomplished in real time. In the past, radar signal processing 
has been performed on special purpose computing devices that offer extremely high 
performance at the expense of flexibility. The emergence of concurrent processing 
techniques and hardware has opened the possibility of achieving real time data rates 
on an array of parallel computers that are easily reprogrammable and reconfigurable. 
The advantage gained will be to shift the primary development effort from hardware 
to software, greatly enhancing the adaptability of the system to future changes. 

Investigation of parallel processing techniques has largely been carried out on 
a PC based parallel computer called the transputer. Chapter 2 will introduce the 
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processor requirements that have been specified for a microburst detector and consider 
how thej' translate into requirements for a multicomputer. Chapter 3 will discuss 
ways of analyzing the performance of multicomputers, and Chapter 4 will describe 
the transputer system and the Occam model of concurrent processing. Chapter 5 will 
describe a transputer based implementation of a concurrent radar signal processor 
including algorithms, hardware, and benchmarks. Finally, Chapter 6 will bring all 
these issues into focus, presenting conclusions drawn from this research and making 
recommendations for future work. 
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CHAPTER 2 

COMPUTATIONAL REQUIREMENTS 


The computational requirements for a real time microburst detector are deter- 
mined by certain hardware parameters of the windshear radar, the computational load 
of the desired algorithms, and the amount of processor margin factored into the calcu- 
lation. Estimates of these requirements put forth by E. G. Baxa and M. A. Richards 
are collected and summarized in a report by the Radar Signal/Data Processor (RSDP) 
Task Force [22]. Processor requirements for several sets of algorithms of varying com- 
plexity, referred to as algorithm suites, are given in terms of the following well known 
performance measures for uniprocessor computers: 

1. million instructions per second (MIPS), 

2. million floating point operations per second (MFLOPS) [23], 

3. thousand bytes of memory storage (KByte). 

These values are intended to provide a guideline for the kinds of hardware choices that 
are available. It should be noted that neither MIPS nor MFLOPS are in themselves 
sufficient measures for comparative computer performance. Differences in machine 
instruction sets and user programs make it impossible to construct a single perfor- 
mance metric that is meaningful for all types of computers and application programs. 
The only meaningful measure of computer performance is the execution time for the 
specific application program of interest to the user [21]. Since it is not reasonable 
to measure the execution time of the application on all possible types of computers, 
and the nature of the application may not be completely defined beforehand, these 
performance numbers serve to reduce the field of candidate systems. To illustrate 
how values for these parameters are obtained, an analysis following that given in the 
RSDP report is recreated below and extended to the case of concurrent processing. 



2.1 Radar Parameters 


An X-band airborne Doppler weather radar currently operated by the Antenna 
and Microwave Research Branch (AMRB) of NASA Langley Research Center to flight 
test the windshear dectection system is the source of hardware parameters used to 
calculate the processing requirements [24]. Many of the radar’s characteristics are 
adjustable allowing it to operate at several PRFs, pulse widths, scan rates, et cetera. 
The flexibility of the AMRB test radar makes it a reasonable choice for a benchmark 
system since a production windshear radar is likely to operate within its worst case 
configuration. Appendix A lists the relevant radar parameters and their possible 
settings, but the most challenging configuration for a real time processor would be as 
follows: 

1. PRF = 9581 pulses per second, 

2. number of range cells (Nr) = 69, 

3. scan rate (0) = 37.5° per second, 

4. scan width (6s) = 60°, 

5. azimuth lines per scan width sector (N$) = 40. 

Note that these values do not represent the worst case for each parameter indepen- 
dently, rather they constitute a collective worst case configuration. 

With the radar parameters specified, the maximum allowed processing time can 
now be calculated. Consider the case where an averaged and sampled value from 
each range bin is collected over a certain number of IPPs to form a range cell record 
and all of these range cell records are block processed in real time. Then the time 
available in which to perform the necessary processing, f p , is determined by the P RF 
and the number of complex data points, N[, in each range cell data record. 
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p PRF' 


The value of t p reflects the amount of time to process the returns from one pulse 
of the radar, so on a uniprocessor that time must be divided among all the range cells 
leaving Ir as the time to process each range cell where f# is given by 


tR Nr 

If all the range cells can be processed concurrently, then nearly the full t v could be 
dedicated to processing each range cell record. Given the worst case radar parameters 
and a Nj of 128 pulses, t p would be 13.36 msec and tR would be 0.19 msec. 

The time available for processing can be increased somewhat by taking advantage 
of the entire scan time between azimuth lines. Typically the returns from Nj pulses 
are processed per line of azimuth and any additional time required for the antenna 
to scan to the next azimuth line can be considered free for processing. Thus the full 
t p is determined by 

t =-^~ 

P ONe 

The degree to which t p is increased depends on the PRF . For large values of the P RF 
the improvement can be significant. Applying the worst case radar parameters to the 
revised expression for t p gives 40 msec (or 0.58 msec per range cell on a uniprocessor). 
That amounts to nearly a three fold increase over the time calculated by the previous 
method. 


2.2 Algorithms 

Once the available processing time has been identified, then the algorithmic 
complexity is assessed to determine the number and type of operations that must 
occur within that time and to determine the amount of memory storage necessary 
to hold the program and data. To facilitate analysis, four suites of algorithms are 
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presented, ranging from a minimal baseline system to a growth system. The following 
is a list of algorithms presently being considered for the microburst detector: 

1. time domain clutter rejection filtering, 

2. spectral domain clutter rejection filtering, 

3. Fourier spectral estimate, 

4. inverse Fourier transform, 

5. autoregressive spectral estimate, 

6. time domain mean and width estimate (pulse-pair), 

7. Fourier domain mean and width estimate (pulse-pair), 

8. hazard factor computation. 

The four algorithm suites are shown in flow diagram form in Figures 2.1 - 2.4. 
Suite I consists of a 2 nd order HR Moving Target Indicator (MTI) clutter rejection 
filter followed by a pulse-pair processor and hazard factor computation. This suite is 
considered to offer the baseline level of acceptable performance. In suites II and III 
the 2 nd order HR filter is replaced by a 39^ order FIR filter followed by a spectral 
estimate, some additional clutter filtering in the spectral domain, and finally spectral 
domain mean and width estimates. In suite II the spectral estimation is done by a 
Fast Fourier Transform (FFT) while in suite III it is done by autoregressive modeling. 
The growth system, suite IV, includes all of the algorithms listed in the first three 
suites. Obviously some of these algorithms are redundant, but this suite is designed 
to anticipate future changes in the system. 

For each of the algorithm suites, an estimate of the number of instructions, 
number of floating point operations (FLOPs), and required memory locations is made. 
Dividing the number of instructions and the number of floating point operations by 
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Figure 2.1 Algorithm Suite I Flow Diagram. 
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Figure 2.2 Algorithm Suite II Flow Diagram. 
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Figure 2.3 Algorithm Suite III Flow Diagram. 
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Figure 2.4 Algorithm Suite IV Flow Diagram. 
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Figure 2.5 Performance Parameters. 


the available processing time gives the desired performance parameters MIPS and 
FLOPS respectively. 

Finally, a processor margin of 100% is factored into the analysis. Processor 
margin is an attempt to account for inaccuracies of MFLOPS and MIPS as measures 
of computer performance and allow for future growth of the system. For a parallel 
processor, the processor margin can also be used to account for overheads associated 
with concurrency, for example communication overhead. Figure 2.5 summarizes the 
resulting computational requirements for a uniprocessor and for a multicomputer in 
terms of performance per range cell. 
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CHAPTER 3 


CONCURRENT PERFORMANCE 


The previous chapter introduced some performance metrics that are commonly 
used to evaluate single processor computers; MFLOPS, MIPS, and KBytes. These 
same parameters, along with the interprocessor communication bandwidth (measured 
in Mbytes/sec), are used to characterize the hardware of a multicomputer. Additional 
measures are useful to quantify application specific issues such as how the problem 
scales with the number of processors applied to it, what communication and syn- 
chronization overheads are associated with it, and what fraction of the problem is 
inherently sequential. While total execution time is still the ultimate criterion for the 
performance of multiple processor computers, measures such as speedup, efficiency, 
and communication overhead add valuable insight into the behavior of concurrent 
machines and may point to ways of improving their performance. 


3.1 Communication Overhead 

Time spent on interprocessor communication on a parallel computer, in the sim- 
plest view, can be interpreted as a drawback to the multiprocessor approach in com- 
parison with the uniprocessor alternative. It has been argued that this is an unfair 
evaluation since interprocessor communication times should in fact be compared to 
the time to access tiered or virtual memory on a sequential computer [20]. The fol- 
lowing discussion will ignore this point and concentrate on ways of estimating the 
overhead due to communication. 

Following the notation suggested by Fox [20], the fractional communication over- 
head of an application is defined as 

Tromm 


fc = 


T C alc 
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where T comm is the total time spent on communication and T ca i c is the total time spent 
on calculations. T comm and T ca i c can be written in terms of hardware parameters 

t comm = the average time to communicate a word between two nodes. 
t ca ic = the average time the perform a calculation. 

Both of these parameters are loosely defined because the best strategy for measuring 
their values often depends on the specific application. For example, the average 
length of a message can have a significant effect on t C omm due to the startup latencies 
associated with each communication. In scientific applications f CQ / c is often assumed 
to be the time for a floating point operation, but the various types of floating point 
operations can involve a wide range of execution times. The relative occurrence of 
each type of floating point operation in the code may need to be taken into account to 
arrive at an accurate estimate of t ca i c . These issues, among others, would have to be 
considered in designing a scheme for accurately measuring t comm and t ca i c , but often a 
rough estimate is sufficient for analysis purposes. A simple method for estimating the 
ratio of t comm to t ca i c is to use the ratio of a multicomputer’s floating point performance 
to its communication bandwidth as follows: 

tcomm ~ MFLOPS_ 

t calc ~ MMzi 

sec 

The importance of fc is that for good performance, it should have a relatively small 
value, usually on the order of unity, and it should not increase dramatically as more 
processors are applied to the problem. If fc is strongly dependent on the number of 
processors, that may indicate that the problem is not very scalable. 

As an illustration of how fc might be calculated, consider a hypothetical multi- 
computer intended to provide signal processing for a pulse Doppler radar windshear 
detector. For simplicity, it will comply with the following assumptions. 

1. The computational requirements from Chapter 2 are satisfied. 
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2. The number of nodes is equal to the number of range cells, jV/?. 

3. A single communication is required to distribute each range cell. 

4. The radar is in the worst case configuration. 

5. Nj equals 128 samples (arbitrary). 

6. The algorithm suite corresponds to the growth system. 

Given the above conditions, T comm and T ca i c could be estimated by: 

T C omm 2A r /A(Rf (:omm , 

Tcaic ~ N FLOPS* calc j 


where Nflops is the total number of floating poing operations determined in the com- 
putational requirements. By plugging in the values for the worst case configuration, 
the communication overhead is approximated by 

tcomm 


fc ~ c- 


tcaic 


where 


2N[Nr 

c = - = 0.0026. 


Nflops 

The small value for c is due to the computational overkill of the growth suite of algo- 
rithms and, to a certain extent, an underestimation of the amount of communication 
resulting from assumption 3. Such a small value of c may indicate that distribut- 
ing each range cell computation onto more than one processesor may yield better 
performance for that particular algorithm suite. 
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3.2 Speedup Measures and Efficiency 

Perhaps the most important measure of performance for parallel computers is 
speedup, S(N), which is the ratio of the execution time on a single processor to the 
execution time on a parallel computer with N processors. 


S(N) = |L 



The efficiency can be reduced from its ideal value of unity by a number of factors. 
The concurrent algorithm may not be as good as its sequential counterpart, or even 
where the same algorithm is used, the concurrent implementation may require addi- 
tional instructions to control the distribution of the problem over multiple processors. 
Load balancing can also be an important issue since the problem can only run as fast 
as the slowest node, and of course the communication overhead that was discussed in 
the previous section has an effect on the efficiency. Of all these factors, the commu- 
nication overhead is the most significant for the radar signal processing application. 
The algorithms being used require little modification to accomodate concurrency, and 
load balancing is fissured as long as an equal number of range cells are processed on 
each node. 

Bounds on both the speedup and efficiency have been established in previous 
work by Amdahl [25] and Eager et al [26]. Since the efficiency follows directly from 
the speedup, only bounds on speedup will be considered here. 

Every algorithm can be divided into portions that are independent and can thus 
be performed in parallel and portions that are inherently sequential. The speedup of 
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an algorithm is limited by the extent to which its code is sequential. More specifically, 
if / is the fraction of the code that is sequential, then Amdahl’s law [25] states that 
speedup is bounded by: 


S(N) < 


1 



For example, if 20% of an algorithm is inherently sequential, then no matter how 


many processors are applied to the problem, the maximum achievable speedup is 5. 
It has been argued that Amdahl’s law leads to an overly pessimistic evaluation of 
large scale parallel computing. One argument is that many algorithms can be written 
to reduce the number of sequential operations to an acceptably low level if they are 
designed for concurrency from the start rather than modified from existing sequential 
code [20]. It has also been suggested that instead of running a fixed sized problem 
on different numbers of processors, a more realistic approach is to fix the execution 
time and solve the largest size problem possible [27]. 

A technique for bounding speedup similar in spirit to that of Amdahl is the use 
of the average parallelism measure proposed by Eager, Zahorjan and Lazowska [26]. 
The average parallelism of an algorithm, A, can be defined in a number of ways, but 
here it will be defined as the average number of processors that would be active during 
the execution of an algorithm if an unlimited number of processors were available. 
The upper and lower bounds on the speedup in terms of the average parallelism and 
the number of processors are given by 


NA 


< S(N) < min(A r , A). 


N + A- 1 

For range cell processing, the average parallelism can be estimated by the number of 
range cells, Nr. Assuming there is a node for each range cell the bounds on speedup 
become 


Nr 2 


< Sn < Nr. 


2Nr — 1 

The theoretical limits on speedup will be compared to actual measured values in 
Chapter 5. 
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CHAPTER 4 
THE TRANSPUTER 


A system for processing pulse Doppler radar signals has been implemented on a 
transputer-based multicomputer. Transputers are a family of computers that com- 
bine a microprocessor, memory, and four serial communication links all on a single 
VLSI chip. Figure 4.1 shows a block diagram of the T-800 transputer architecture. 
Transputers can be housed in an ordinary PC on full length add-m cards and the 
transputer nodes on those cards can be arranged in a variety of network configura- 
tions controllable in software. While compilers for the transputer include FORTRAN, 
C, and PASCAL, they are most easily programmed in OCCAM, a language designed 
for concurrent processing that is tightly coupled to the architecture of the transputer. 
In the sections below the transputer architecture will be discussed, OCCAM will be 
introduced both as a language and as a model for concurrent processing, and finally 
transputer and OCCAM support for real time processing will be covered. 

4.1 Transputer Architecture 

A distinguishing characteristic of the transputer is the cohesiveness and simplic- 
ity of its design. This can be seen in the hardware support provided for the OCCAM 
model of concurrent processing, in the freedom that the software developer is given 
from considering strictly hardware issues, in the ease with which members of the trans- 
puter processor line can be interchanged with only minor software changes, and in the 
inherent self-sufficiency of the transputer chip. With very little external circuitry a 
transputer chip can be made into a fully functional multicomputer node. Networks of 
transputer nodes communicate by passing messages over their serial links. Since each 
node is equipped with four links, transputer networks can assume topologies with 
degree less than or equal to four. Some examples of possible transputer networks are 
shown in Figure 4.2 [28]. 
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Figure 4.2 Examples of transputer networks. 
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4.1.1 Microprocessor and Memory 

The transputer microprocessors come in 16 and 32 bit varieties and are essen- 
tially reduced instruction set computers (RISC) except for certain instructions that 
support scheduling and message passing [29]. An individual transputer is a sequential 
processor operating on a single instruction stream. A powerful feature of the trans- 
puter is its hardware process scheduler that performs high speed multitasking. Task 
switching times are typically about 1 microsecond for OCCAM processes [30]. The 
ability to multitask allows individual transputers to simulate concurrency. A pro- 
gram made up of multiple processes can be run on a single transputer or a network 
of transputers with only slight software modifications. This means that applications 
can be developed and tested on a single transputer and then mapped on to a network 
without changing the program logic. 

There are two levels of memory hierarchy on a transputer node. The fastest is 
the on-chip static RAM (SRAM), normally used by compilers as a register stack for 
storing often used addresses and variables but also addressable by programmers who 
wish to optimize performance. The access time for on-chip memory is one cycle which 
on a 30 MHz T800 equals 33 nanoseconds. External dynamic RAM (DRAM) requires 
more C 3 'cles to access, but up to 4 Gbytes can be addressed. An on-chip external 
memory interface drives the external DRAM without any additional circuitry. The 
difference between using internal or external memory is generally transparent to the 
user except for the speed. 

Figure 4.3 summarizes the characteristics of the more well known members of 
the transputer family. The T800 has an on-chip 64-bit IEEE floating-point processor 
(FPU) [23]. The T800 FPU provides much greater floating point performance than 
that of T212 or T414 which use a run-time subroutine package to do floating point 
operations. The latest generation of transputer is the T9000 which is not yet com- 
mercially avialable, but boasts an order of magnitude increase in performance over 
the T800. 
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T212 

T414 

T800 

T9000 

Clock speed (MHz) 

20 

20 

17.5 

20 

30 

50 

Bus width (bits) 

16 

32 

32 

32 

32 

32 

On chip SRAM (Kbytes) 

2 

2 

4 

4 

4 

16 

MIPS 

10 

10 

8.75 

10 

15 

200 

MFLOPS 

* 

0.1 

1.31 

1.5 

2.25 

25 


Figure 4.3 Transputer family statistics. 


Based on the computational requirements from Chapter 2 and the transputer 
performance specifications and assuming linear speedup, a network of roughly 6 T800 
transputers would be required to implement the baseline suite, 30 T800s for suite II, 
54 T800s for suite III, and 76 T800s for the growth suite. If the T9000 was used, the 
required number of processors would be reduced to roughly 1 for the baseline suite, 
3 for suite II, 5 for suite III, and 7 for the growth suite. 

4.1.2 Links 

The four communication links provide the sole means of interprocessor communi- 
cation in transputer networks. Within a network, processors run asynchronously until 
a communication needs to take place between two of the nodes. At this point the two 
processors are synchronised by the links' hand-shaking hardware, the communication 
takes place, and the processors proceed to run asynchronously. 

When a communication takes place, data is transferred directly from the memory 
space of one processor to the memory of the other via the link’s DMA controller. The 
T212, T414, and T800 links can all operate at 0.625, 1.25, or 2.5 Mbytes/sec and the 
T9000 links will be able to operate at 12.5 Mbytes/sec. Even though the T9000 has 
much better maximum performance than previous models, all the transputer models 
are link compatable so that mixed networks can be built. 

A high degree of concurrency is built into the transputers link architecture. Each 
link is bi-directional, or full duplex, meaning that communication can take place in 
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both directions simultaneous!}'. In addition, all four links can be active simultane- 
ously, while at the same time the processor can be executing instructions. If the chip 
has a floating point unit, as does the T800, then that too can operate concurrently. 

4.2 Occam Model of Concurrent Processing 

OCCAM takes its name from the 14^ century English philosopher William of 
Occam and his famed “Occam’s razor”: Entia non sunt multiplicanda praeter neces- 
sitatem. Loosely translated this means don’t add complication beyond necessity and 
is the underlying philosophy of OCCAM, to keep it simple. Concurrency and com- 
munication are built into the language according to the Communicating Sequential 
Processes (CSP) model of concurrency proposed by C. A. R. Hoare of Oxford Univer- 
sity [31]. The CSP model of concurrency is based on the concept of building parallel 
applications from networks of sequential processors communicating through synchro- 
nized point-to-point channels. Key elements of CSP that have been subsequently 
adopted by OCCAM are include the following: 

1. a guarded command to allow for non-determinism, 

2. a parallel command to specify concurrent execution, 

3. input and output primitives for communication between parallel processes, 

4. point-to-point one way channels, 

5. strong typing of variables and channels to assure secure communications. 

In OCCAM the basic programming entities are processes which begin, perform 
some action, and eventually terminate [30]. Processes executing in parallel and com- 
municating through channels form the basis of the OCCAM model of concurrency. 
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4.2.1 Primitive Processes and Channels 

All processes are ultimately made up of three primative processes: assignment, 
input, and output. The purpose of an assignment command is to change the value of 
a variable. This can occur by changing it to a specified constant or to the value of an 
expression. An example of assignment is 

x := y + 7 

where the value of x is changed to the result of the expression y + 7. 

Input and output are built into OCCAM at the most basic level. They send and 
receive messages on channels that connect parallel processes. Channels are the only 
way that two processes running in parallel can share information because sharing 
of global variables would lead to unpredictable behavior. Since parallel processes 
execute asynchronously, a process needing to access such a variable would have no 
way of knowing when that variable was last updated and whether or not it contained 
the desired information. To simplify the situation, OCCAM allows only point-to-point 
one way communication between processes. Before they can be used, channels must 
be declared as a certain type, i.e. INT for integer, and given a name. This may seem 
restrictive, but protocol definitions can be substituted for the type allowing a virtually 
unlimited number of combinations, and the benefit is highly secure communications. 
If channel 1 and channel2 connect two processes in parallel, then 

channel 1 ? x 

would read the value from channel 1 into the variable x in the current process, and 

channel 2 ! y + 5 

would send the value of y + 5 down channel2 to the other process where a correspond- 
ing input command would read it into a variable local to that process. The symbols 
? and ! are the commands for inputting and outputting on a channel respectively. 
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4.2.2 Constructions 


Compound processes are built by combining primitive processes within construc- 
tions. Various constructions supported by OCCAM are illustrated by example below. 
In a language designed for concurrency, sequential operation can not be assumed as it 
is in sequential languages. In OCCAM sequential operation must be explicitly stated 
with the SEQ command. 

SEQ 

prod 

proc2 

proc3 

The processes, prod, proc2, and proc3, are then executed one after the other in the 
order listed. Termination of SEQ process occurs when the last process, in this case 
proc3, terminates. The scope of a construction is marked by an indentation of two 
spaces. In OCCAM, indentation is used for grouping commands in the same way that 
BEGIN. . .END and {. . . } are used in PASCAL and C respectively. The command 
to stipulate parallel execution is PAR. 

PAR 

prod 

proc2 

proc3 

This time prod, proc2, and proc3 will be executed concurrently with no regard for 
the order in which they are listed. Termination of a PAR process occurs when the 
slowest process within its scope terminates. 

For cases where a process may need to input from one of several channels and 
perform a task based on which of the channels is communicating, OCCAM provides 
a command for alternation. 


40 


llMi«illil«|M» ■ 1 1 Mil IN* ini I! IlMNII IIM IN II ■ ilMUMMIMI I II III llllllMI Ill I l! INI 1 1 I «| III HI III NH I I III II I III I I II lUI 


ALT 

chanl ? x 
procl 
chan2 ? x 
proc2 
chan3 ? x 
proc3 


In the example, one of the three processes is executed depending on which of the 
channels first becomes ready to communicate. The input statement in an alternation, 
called a ‘guard’, can consist of an input process and Boolean expressions to further 
define which of two simultaneously ready channels will be accepted. 

SEQ, PAR, and ALT statements can be “replicated” to form arrays of processes. 
The simplest example is a replicated SEQ which is roughly equivalent to a FOR 
statement in most conventional languages. 

SEQ i = 0 FOR n 
procl 
proc2 
proc3 


The remaining flow control constructs are similar to those found in conventional 
languages. IF and CASE statements provide conditional branching and a WHILE 
statement provides conditional looping. For examples of OCCAM code, see Appendix 
B. 


4.3 Real Time Issues 

From their beginnings, transputers were targeted for embedded systems applica- 
tions. As a result, they have many features that make them attractive as real time 
processors. Most of the essential functions for simulated and true concurrency, such 
as multitasking and message passing are strongly supported in hardware for max- 
imum efficiency. The process scheduler also supports two priority levels (high and 
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low) for processes. The level of the priority can be specified in the OCCAM code. 
A high priority process will run to the exclusion of all other processes until it can 
no longer proceed or terminates. Low priority processes can be interrupted by high 
priority processes and are serviced in a round robin fashion. A system of priorities 
allows short but time-critical processes to be executed without interruption which is 
often a requirement in real time systems. 

A hardware timer gives programs access to a 1 microsecond clock for high priority 
processes and a 64 microsecond clock for low priority processes. Considering the time 
dependent nature of real time programming, hardware timing support is a significant 
asset. It facilitates the scheduling of time critical events and clocking of benchmarks. 
In OCCAM, timers are accessed through a special variable type called a timer. 

In a real time system, there is often a need to communicate with external devices. 
Interfacing of peripherals can be handled by one of four methods. 

1. The external memory bus. 

2. Dual-ported memory. 

3. An interrupt. 

4. Links. 

The first two options are not in keeping with the transputers design philosophy, 
but they are possible for special purposes. One interrupt pin, or event pin, is pro- 
vided on T800 and earlier transputers that can be used to trigger interrupt service 
routines. Turnaround times for servicing interrupts are under 1 microsecond. The 
favored method of handling any form of communications on the transputer, however, 
is through the links which can also be interfaced with parallel ports by the use of 
external link adapters. 

One of the stated advantages of OCCAM is that it allows code to be developed 
independent of the hardware on which it will eventually run. At some point, however, 


42 


nun mu i Hi'iiM'HT iii mu in ihhh i Lin. ii 4ui liiiiikiiiiiiiiiiiiiiiiiiiiiiHi i 



processes must be loaded and run on physical transputer nodes, and channels must be 
placed on their corresponding links. Resource allocation in OCCAM is handled by the 
commands PLACE and PLACED which are used to assign channels and processes 
to their respective physical devices. In addition, OCCAM allows insertion of low 
level T-code into OCCAM programs. Fortunately, the close relationship between 
OCCAM and the transputer hardware usually eliminates the need to resort to low 
level programming. 
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CHAPTER 5 

CONCURRENT PROCESSING IMPLEMENTATION 


5.1 Algorithms 

Some of the algorithms listed in the suites from section 2.2 have been translated 
into OCCAM for implementation on the transputers. In certain cases special tech- 
niques have been used to speed up the algorithms taking advantage of the nature 
of the problem, the transputer, or OCCAM. In the following sections, these algo- 
rithms will be described along with any details of their implementation. Appendix B 
provides OCCAM listings of all the algorithms described below. 


5.1.1 Pulse-Pair Estimator 

The pulse-pair algorithm has been introduced previously as an estimator of the 
windspeed mean and variance. The technique is based on estimating the complex 
autocovariance of the I&Q return from the pulse Doppler radar. If successive pairs 
of pulses are statistically independent, then it has been shown [32] that a maximum 
likelihood estimator of the complex autocovariance, Rzz{T a ), is given by 


Rzz(T.) = A £ Z-(iT,)Z([i + 1]T,), 

M »'= 0 

where M is the number of pulses, T a is the sampling time between pulses (IPP), and 
Z(iT,) is the complex I&Q sample at time iT,. An estimate of the mean windspeed 
is then given by 

v PP = j^rarg[Rzz(T,)}, 

and the width estimate by 


2 

’ PP “ (2ttT s ) 2 


1 


I Rzz(T.)\ 

Rzz( 0 ) . ’ 


These equations describe a method for obtaining mean and width estimates of the 

windspeed given a complex series of pulse returns in the time domain. An equivalent 

estimate can be obtained from the spectrum by noting that 

M—l 7irkT 

Rzz(T , ) = £ Sz(k)e^f, 

A *=0 

where Sz(k ) denotes the power spectral density. Then taking the argument of 
Rzz(T,) gives 

arg Rzz(T.) - arctan j j-Mj. Sz(i)cos( ^) ) ’ 

and estimates of v pp and w pp can be found as before [33]. 

Both implementations of the pulse-pair estimator have the advantage of being 
more computationally efficient than methods based on the FFT. It has also been 
shown that the pulse-pair estimate has a smaller variance than the FFT estimator 
for low signal to noise levels [19, 17]. The OCCAM implementation of the pulse-pair 
estimator is straightforward. 

5.1.2 Fast Fourier Transform 

A class of algorithms known as FFTs provide a way of calculating the discrete 
Fourier transform, DFT, that is computationally efficient and works well for a wide 
range of problems. The algorithm used here is a complex radix-2 decimation in 
frequency implementation of the FFT translated and modified from a FORTRAN 
version taken from IEEE Programs for Digital Signal Processing [34]. 

In order to speed up the calculation, various modifications have been made to the 
basic FFT algorithm. First it was noticed that a large percentage of the computation 
time (roughly 38%) was spent on calculating the sine and cosine values for the twiddle 
factors. For fixed data record lengths, a fixed set of these twiddle factors are needed, 
so they are computed beforehand and stored in memory for later use by the FFT. 
Techniques for performing the bit reversal are another area of interest for speeding 
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up the FFT [35, 36]. The implementation of the FFT in OCCAM benefited from an 
efficient bit reversal function built into the compiler libraries, and no other steps were 
taken to improve the bit reversal algorithm. 

A method for improving the efficiency of the FFT algorithm when there are a 
large number of zero- valued samples in relation to actual data, is FFT pruning [37]. 
This technique elimates unnecessary computations that involve zeros by “pruning” 
off those branches of the computation. The necessary modifications to the algorithm 
are trivial. It has been shown that the time saved by pruning, tr, is approximately 

, [L + 2(1 — 2-<"-*)] 

tr = 77 » 

M 

where 2 L data points are padded with 2 M ~ L zei'os to form 2 M -point FFT [37]. Pruning 
was left in place even in cases where no zero padding was used since the computational 
overhead for pruning was found to be extremely small. 

The FFT has two primary disadvantages inherent in its approach. The first 
disadvantage is that frequency resolution is limited by the inverse of the length of 
the data record. This limits the ability of the FFT to resolve two or more signals 
closely spaced in frequency. The second disadvantage involves the use of finite length 
sequences to represent signals of infinite extent. By assuming the sequence to be zero 
valued outside a finite number of samples, an implied windowing is imposed on the 
data. It is equivalent to multiplying the data by a rectangular window with unity 
amplitude. In the spectral domain, that is corresponds to convolving the spectrum 
with a sine function. This effect is known as spectral leakage because energy at 
frequencies not represented in the basis set “leak” into frequencies over the full range 
of the basis [38, 37]. These limitations of the FFT are especially problematic for short 
sequences of data which may be the case in a pulse Doppler radar application. 

5.1.3 Autoregressive Modeling 

An alternative to the FFT for estimating the spectrum is autoregressive (AR) 
spectral estimation. This technique has been applied previously to radar applications 
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[39, 40, 41, 42, 43, 44, 45]. The general approach is to fit an AR model to the 
sampled time series, and from the coefficients of the model generate the spectrum. 
The ‘goodness’ of the spectral estimate often depends on how appropriate an AR 
model is for the underlying process [46, 47]. 

An AR model is an all pole model of the form 


H{f) = 


1 

1 + EL 1 a k exTp(—j2nfkT) ’ 


where are the model coefficients, p is the model order, and T is the sampling 
interval. Once the AR coefficients are determined, the power spectral density can be 
found as 

P m _ T A 

mJ> ~ II + S-, o»exp (-jlrfkT)P' 

where a 2 is the white noise power. Thus, the parameters that need to be estimated 
are the a*’ s and a 2 . Many methods for computing these parameters have been devel- 
oped, some that rely on estimates of the autocorrelation function and others that are 
based on a least squares linear prediction approach [48]. The algorithm chosen for 
this application is of the latter class. It is a block data, as opposed to recursive, mod- 
ified covariance algorithm that has been translated into OCCAM from a FORTRAN 
program supplied with Marple’s book, Digital Spectral Analysis with Applications [48]. 

Advantages of using the modified covariance method of spectral estimation are 
an improved frequency resolution compared to the FFT, for low signal to noise ra- 
tios, and an ability to estimate the spectrum at any frequency within the processing 
bandwidth instead of at frequencies predetermined by the length of the data record. 
Also the problem of spectral leakage is eliminated because the model does not force 
the sequence to be zero valued outside the range of the data. Disadvantages are that 
it is more computationally intensive than the FFT, requiring Np + 6p 2 operations as 
opposed to N\og 2 (N) for the FFT [48]. Also the AR model may not be well suited 
to the problem, often resulting in high order values of p to adequately represent it. 
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Choosing an appropriate model order is not always a straight forward process. Sev- 
eral algorithms have been proposed for this purpose [48, 49, 50, 51], but it has been 
shown that none of these methods works well for short segments of actual data [52]. 

5.2 Hardware 

The transputer network used to test the above algorithms and obtain bench- 
marks for processing radar return concurrently is shown in Figure 5.1 as a block 
diagram. It consists of 9 transputers, eight of which are T414s configured in a 3-D 
hypercube and the nineth, a T800 root node, serving as a gateway to the PC. The 
hypercube topology was chosen because it is flexible, in the sense that many other 
topologies can be mapped on the hypercube, and because it is a popular network that 
has received considerable attention in parallel processing literature [20]. The most 
important attribute of a network for this application is that a minimal number of 
interprocessor communications be necessary to distribute the range cell data. The 
hypercube satisfies this requirement. 

On the PC, Radar I&Q data is stored in a file which is read into the memory of 
the root node by an interfacing program written in the C language. At this time the 
user is asked to specify eight range cells to be displayed to the screen. All the data is 
then distributed by range cell throughout the network by routing programs resident 
in each node. A spectral estimate is performed in each node, and then the results are 
recollected in the root node where Doppler spectrums for the eight specified range 
cells are read directly into the PC memory by the C interfacing program. Finally, a 
C program resident on the PC host is invoked to display the results on the screen. 
Figure 5.2 shows a display of some simulated data. 

5.3 Benchmarks 

Measurement of benchmarks was complicated by the use of two different trans- 
puter processor models operating at different clock speeds. The transputer network 
was made up of one T800 running at 17.5MHz and eight T414s running at 20MHz. 
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Figure 5.1 Block diagram of the transputer evaluation set-up. 
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Total points 
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74.43 

43.42 

64 
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512 

68.82 

40.15 

32 

480 

512 

63.02 

36.76 


Figure 5.3 FFT benchmarks with pruning. 


The primarj' difference between the T800 and the T414 is the TSOO’s FPU which 
gives it much better floating point performance than the T414. For this reason, 
benchmarks for individual algorithms were performed on the TSOO and the execution 
times were scaled to show the expected performance on the fastest available trans- 
puter, a TSOO running at 30MHz. Measuring speedup, however, required the use 
of multiple transputers so those measurements were performed on networks of the 
T414s. 

5.3.1 Fast Fourier Transform 

Two sets of benchmarks were compiled for the FFT. The first, shown in Figure 
5.3, illustrates the effect of pruning. The number of actual data points was varied 
from 512 down to 32 with zeros added as required to make a total of 512 points. 
Notice that a large ratio of zeros to data points is necessary before significant time 
savings are seen. For a 512 point FFT on a 30MHz TSOO, Figure 5.3 shows that 
pruning 480 zeros only saved 9.13 msec, or 20% of the total execution time. In fact 
for a one to one ratio of zeros to data points pruning gives no improvement. It is 
evident that even with pruning, zero padding incurs a significant computational cost. 

Figure 5.4 shows the FFT benchmarks with no zero padding. Recall that the 
window of time available for processing all the range cells was found in section 2.1 
to be 40 ms. Comparing this value with the execution times listed in the Figure 5.4, 
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Pulses 

TSOO @ 17.5MHz 
Time(msec) 

TSOO @ 30MHz 
Time(msec) 

512 

78.67 

45.89 

256 

36.79 

21.46 

i 128 

16.71 

9.75 

64 

7.74 

4.52 

32 

3.56 

2.08 


Figure 5.4 FFT benchmarks with no zero padding. * 

"i i 

j 1 

notice that even on the 17.5MHz T800 for 128 or fewer samples the computation can \ 

take place within the available processing window. On a 30MHz T800 the number of | 

samples can be extended to 256. The assumptions here, of course, are that a single I 

processor is allocated to each range cell and that the overheads for using such a large ! 

1 1 

1 . ! 
number of processors are ignored. 

5.3.2 Autoregressive Model 

i 

The AR spectral estimate was run for model orders 3, 5, and 10. Order 3 
was chosen as a baseline because it has enough poles to account for a clutter mode 
and weather mode plus an extra pole to provide some additional detail. It would 
not be unreasonable to use a 2 nd order model just to capture those two dominant 

i 

modes, and it should be noted a 1 J| order AR model is equivalent to the pulse pair 

i 

estimator [53]. Model order 5 was chosen as an intermediate level because it seems 
to adequately represent the Doppler power spectrum of simulated microburst data 
based on qualitative comparison with the FFT, and order 10 was chosen as a growth 
case based on similar qualitative grounds and previous analysis of simulated clutter 
[15]. Figure 5.5 shows the benchmarks for all three model orders for range cell records 
of 32 to 512 pulses. On a 30MHz T-S00 transputer, notice that the 3 rd and 5 th order 
models can accommodate up to 256 pulses within the allowable processing window 
and the 10 t/l order model can accommodate up to 128 pulses. 
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Model order 

Pulses 

T800 @ 17.5MHz 
Time(msec) 

T800 @ 30MHz 
Time(msec) 

3 1 

512 

84.14 j 

49.08 

3 

256 

42.94 

25.05 

3 

128 

22.01 

12.84 

3 

64 

11.86 

6.92 

3 1 

32 

3.56 

2. OS i 

5 

512 

110.83 

64.65 1 

5 

256 

57.65 1 

33.63 

| 5 

128 

30.56 

17.83 

5 

64 

17.43 

10.17 

5 

32 

10.81 

6.31 

10 

512 

173.16 

101.01 

10 

256 

95.55 

55.74 

10 

128 

54.79 

31.96 

10 

64 

35.13 

20.49 ^ 

10 

32 

25.18 

14.69 


Figure 5.5 Autoregressive modeling benchmarks. 
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Processors (N) 

S(N) 

E(N) 

1 

1.000 

1.000 

2 

1.988 

0.994 

4 

3.877 

0.969 

8 

7.552 

0.944 


Figure 5.6 Speedup and efficiency measurements. 


5.3.3 Speedup 

The speedup was measured on networks of 1, 2 ,4, and 8 processors connected 
in hypercubes of dimension 0, 1,2, and 3 respectively. Figure 5.6 shows the speedup 
and efficiency measured for all four networks. Notice that the speedup is nearly linear 
for up to 8 processors with an efficiency at 8 processors of 0.944. If the speedup were 
perfectly linear the efficiency would be a constant of 1. Figure 5.7 plots the speedup 
measurements from Figure 5.6 versus the number of processesors. In the plot, the 
solid line corresponds to linear speedup and the dashed line is the actual measured 
speedup. This figure shows that for a low number of processors, the speedup is nearly 
linear. Recall from section 3.2 that the bounds on speedup for this application were 

found to be 

N A 

wta^ r - - min ( 7V ’ j4 )- 

For a network of N = 8 processors and A = Nr = 40 range cells, this equation gives 
a speedup bound of 6.81 < S(N ) < 8.00 which agrees with the measured value. 
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Speedup S(N) 



Figure 5.7 Plot of speedup. 



CHAPTER 6 


CONCLUSIONS 

Regardless of the type of sensor eventually used to provide airborne microburst 
detection, the signals generated will have to be processed in real time. In order to 
assure reliable detection under all atmospheric conditions, it is likely that a combi- 
nation of sensors and sophisticated signal processing techniques will be required. A 
promising approach to achieving the high data rates necessary for real time processing 
is to perform the signal processing concurrently on an array of multiprocessors. This 
thesis has examined issues involved in implementing a concurrent real time microburst 
detection system using a PC based parallel computer called the transputer. 

The specific strategy investigated here is to distribute the processing of range 
cells over a network of single instruction stream computers that pass messages on 
high speed serial communication links. The processing within a range cell is entirely 
sequential so that communication overheads are minimized. Also, existing algorithms 
can be used with only slight modification. With regard to network topology, a hyper- 
cube arrangement is adopted because it allows the minimum number of link commu- 
nications to distribute the range cell data throughout the network. The hypercube 
also has the benefit of having been well studied and documented in previous works 
[ 20 ]. 

A number of algorithms are implemented to demonstrate that real time process- 
ing is feasible. This effort focuses on the most computationally intensive step in the 
signal processing, estimating the spectrum. Two methods are considered, the FFT 
and AR modeling. The FFT is a well known method for estimating the spectrum that 
is computationally efficient and robust. To improve the algorithms suitability for real 
time processing, look-up tables are used to calculate the twiddle factors and pruning 



is used to eliminate needless computations due to zero padding of the data. Draw 
backs of the FFT are the limited frequency resolution and spectral leakage. Both of 
these limitations are most severe for short sequences of data as may be the case for 
range cell records based on the stationarity of the weather spectrum. Spectral esti- 
mates based on AR modeling do not suffer from spectral leakage and the frequency 
resolution is not as limited; however, AR modeling is much more computationally 
intensive than the FFT. It is also not clear that the most appropriate model for the 
weather plus clutter time series is autoregressive, or what the optimum model order 
is. Analysis of simulated data has indicated that a 5 th order AR model might ade- 
quately represent the spectrum and it may be possible to go as low as 2 order to 
capture just the two modes associated with the clutter and the weather [54]. 

Benchmarks of the FFT and AR modeling algorithms show that the returns from 
a single range cell could be processed in real time on a single transputer. Unfortu- 
nately that means that a network of transputers with a number of nodes equal to 
the number of range cells would be required to process the complete return signal. 
From measurements of the speedup, it is apparent that overheads associated with con- 
currency and especially communication overhead will begin to significantly degrade 
performance for large numbers of processors so that even allowing one processor for 
each range cell would probobly not be sufficient for real time. This analysis is based 
on technology that is currently available, but an inherent property of concurrent pro- 
cessing is that a small improvement in the performance of the individual nodes can 
lead to vast improvements in the overall system performance. In 1992 a new gen- 
eration of transputer called the T9000 will be available that promises an order of 
magnitude increase in performance over previous models. Using T9000 transputers, 
a system such as the hypercube arrangement described in section 5.2 would meet real 
time requirements. 

The transputer evaluation system provides a parallel processing platform for 
conducting a number of research efforts including development and benchmarking 
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algorithms and measurements of concurrent performance in anticipation of advances 
in computer hardware. In addition, it can serve as a quick-look processor for identi- 
fying interesting sections of data during post-processing of flight test data. 

Future work in this area may include investigating the use of medium grain SIMD 
computers to perform the necessary radar signal processing since a single instruction 
stream could be used to perform identical computations on each range cell. Another 
approach might be to distribute the range cell computation over multiple processors 
for implementation on massively parallel networks. It may also be possible to integrate 
some of the radar control functions such as automatic gain control (AGC) on the same 
parallel processor that performs the other signal processing tasks. Finally, application 
of sensor fusion and artificial intelligence principles to microburst detection may be 
called for if a combination of sensor technologies is necessary to provide reliable 
detection. 
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APPENDICES 


Appendix A 
Radar Parameters 


Parameter 

Value 

Frequency 

9.33625 GHz 
9.33772 GHz 

Pulse width 

0.96 to 8.16 psec 

Power 

200 or 2000 Watts 

Pulse Repetition Frequency 

1198, 2395, 3755, 4791, 
or 9581 pulses/sec 

Antenna Gain 

34 dBi 

Two-way Antenna Beamwidth 

2.5° 

Antenna Scan Width 

0, ±5°, ±10°, ±15°, ±20°, 
±30°, ±45°, or ±60° 

Antenna Scan Center 

±90°, adjustable in .25° increments 

Antenna Tilt Range 

-32° to +40° 

Azimuth Scan Rate: 
All PRFs except 3755 
At PRF of 3755 

37.5, 18.75, 9.375, or 4.6875 °/sec 
29.25, 14.625 °/sec 

Antenna Polarization 

Horizontal or Vertical 

Data Encoding 

12-bit I&Q samples 

Number of Range Cells 

1 to 124 


Figure A.l Radar Parameters. 













Appendix B 

Occam Software Listing 
B.l Pulse-Pair 


— {{{ pulse pair 
•INCLUDE "ws.inc" 

PROC PP(CHAN OF C from. pc, CHAN OF PPP to. pc) 

•USE "cmplxlib" 

•USE "snglmath.lib" 

VAL N IS pulses: 

VAL prf IS 3720.0CREAL32): 

VAL prp IS 1 . 0 (REAL32) /prf : 

VAL pi IS 3.141592654CREAL32) : 

VAL tick. rate IS 1000000. 0(REAL32) : 

VAL amult IS 1 .0(REAL32)/((((2.0(REAL32)*pi)*pi)*prp)*prp) : 

REAL32 argrt s , amagnrt s, sum, real. sum, aimag . sum , bmult , tr , t i , r 0 , t ime 
REAL32 eofmean.eofstd : 

[N+1]REAL32 areal, aimag : 

INT I,tl,t2: 

TIMER clock : 

SEQ 

real . sum : =0 . 0 (REAL32) 
aimag . sum : =0 . 0 (REAL32) 

I : = 1 

from. pc ? [areal FROM 1 FOR N] ; [aimag FROM 1 FOR N] 
clock ? tl 
WHILE I<N 
SEQ 

CMul(areal[I] ,-aimag[I] , areal [1+1] ,aimag[I+l] ,tr,ti) 
real . sum : =real . sum+tr 
aimag . sum : =aimag . sum+t i 
I :=I+1 

amagnrts : =SQRT ( (real . sum*real . sum) + (aimag . sum* aimag. sum) ) / 
(REAL32 ROUND(N-l)) 
argrts:= AT AN2 (real. sum, aimag. sum) 
sum:=0 .0(REAL32) 

SEQ 1=1 FOR N 

sum:=sum + ((areal[I]*areal[I]) + (aimag[lD*aimag[I])) 



rO : =sum/ (REAL32 ROUND (N)) 

eofmean : = (30 . 02 (REAL32) *argrts) /pi 

bmult : =ABS ( 1 . 0 (REAL32) - (ABS (amagnrts)/rO) ) 

eof std : = ( (30 . 02 (REAL32) *SqRT(amult*bmult ) ) *2 . 0 (REAL32) ) /prf 

clock ? t2 

time := (REAL32 ROUND (t2 MINUS tl))/tick.rate 
to. pc ! time; eofmean; eofstd 


> 


B.2 Fast Fourier Transform | 

i 

#INCLUDE "hostio. inc" j 

# INCLUDE "ws.inc" j 

— -C-C-C FUN PWR2 j 

INT FUNCTION PWR2 (VAL INT i) j 

INT b : | 

VALOF I 

SEQ } 

b : =1 

b : =ASHIFTLEFT (b , i ) 

RESULT b I 


PROC INDEX. TRIG (VAL INT J.I.REAL32 C.S.VAL [(pulses/2) +1]REAL32 
cos. table, sin. table) 

INT index: 

SEQ 

index : « ( J- 1 ) *PWR2 (I- 1 ) 

C : =cos .table [index] 

S : =sin .table [index] 


— -C-C-C PROC FFT 

PROC FFT (CHAN OF C from. pc, CHAN OF AR to. pc) 


#USE "snglmath.lib" 

VAL twopi IS 6.283185307179586(REAL32) : 
VAL tick. rate IS 1000000. 0(REAL32) : 

VAL INT N IS pulses: 

VAL REAL32 NN IS (REAL32 ROUND N) : 

REAL32 time: 

[(N/2)+l]REAL32 cos .table , sin. table : 
[N+l]REAL32 X,Y : 
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— {{{ Cooley-Tukey Radix-2 
TIMER clock: 

INT I,N2,tl,t2: 

SEq 

SEq i=0 FOR ( (N/2)+l) 

REAL32 r: 

SEq 

r:=(((REAL32 ROUND i)*twopi)/NN) 
cos.table[i] :=C0S(r) 
sin.table[i] :=SIN(r) 

WHILE TRUE 
SEq 
N2:=N 
I : =1 

from. pc ? [X FROM 1 FOR N] ; [Y FROM 1 FOR N] 
clock ? tl 
WHILE I<=M 
INT J.Nl.LL: 


SEq 


N1 

= N2 

N2 

= N2/2 

LL 

= N2 

IF 


I 

< (M-LO) 


LL := L2 
I >* (M-LO) 
SKIP 


J : =1 

WHILE J<=N2 
REAL32 C,S : 

INT K: 

SEq 

INDEX . TRIG ( J , I , C ,S , cos .table , sin . table) 
K:=J 

WHILE K <= N 
REAL32 XT , YT : 

INT L: 

SEq 

L:=K+N2 

XT : =X [K] -X[L] 

YT : =Y [K] -Y[L] 

X[K] :=X[K]+X[L] 

Y[K] :=Y [K]+Y[L] 

X[L] :=(C*XT)+(S*YT) 

Y[L] : = (C*YT)-(S*XT) 

K:=K+N1 
J :=J+1 
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I :=I+1 


VAL REAL32 c IS 60.0(REAL32)/NN : 

[N]REAL32 ws ,z , xtemp .ytemp : 

INT p : 

SEq 

[xtemp FROM 0 FOR N] := [X FROM 1 FOR N] 

[ytemp FROM 0 FOR N] := [Y FROM 1 FOR N] 

SEQ 1=0 FOR N 
SEQ 

p : =BITREVNBITS (I , M) 

z[I] : =10. 0(REAL32)*AL0G10( (xtemp [p] *xtemp [p] )+ 
(ytemp [p] *ytemp [p] ) ) 

ws[I] :=((REAL32 ROUND I) - (REAL32 ROUND (N/2)))*c 
[xtemp FROM 0 FOR N/2] := [z FROM M/2 FOR N/2] 
[xtemp FROM N/2 FOR N/2] := [z FROM 0 FOR N/2] 
clock ? t2 

time := (REAL32 ROUND (t2 MINUS tl))/tick.rate 
to. pc ! time; ws; xtemp 


B.3 Autoregressive Model : 

# INCLUDE "ws.inc" I 

— -C-C-C PROC FFT * 

PROC FFT(VAL [(pulses/2)+l]REAL32 cos. table, sin. table, 

[pulses+l] REAL32 S.f , [ORD+2] REAL32 a.r.a. i ,REAL32 sigma) i 

#USE "snglmath.lib" ■ 

VAL INT N IS pulses : I 

[N+1]REAL32 R.Q: I 

INT I,N2: [ 

— -C-C-C Cooley-Tukey Radix-2 | 

SEQ 

R[lJ:=1.0(REAL32) 

Q [l] :=0.0(REAL32) 

SEQ G=2 FOR ORD 
SEq 

R[G] :=a.r[G-l] 

Q[G] :=a.i[G-l] 

SEQ G= (ORD+2) FOR (N-(0RD+1)) 

SEQ 

R[G] :=0.0(REAL32) 

Q[G] :=0.0(REAL32) 
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N2:=N 
I : = 1 

WHILE I<=M 
INT J,N1,LL: 
SEQ 


N1 

= N2 

N2 

= N2/2 

LL 

= N2 

IF 


I 

< (M-LO) 


LL := L2 

I 

>= (M-LO) 


SKIP 

J : =1 


WHILE J<=LL 
REAL32 C,S : 

INT K: 

SEq 

INDEX . TRIG ( J , I , C , S , cos . table , sin .table) 
K:«J 

WHILE K <= N 
REAL32 XT.YT: 

X IS [R FROM 0 FOR (N+l)] : 

Y IS [Q FROM 0 FOR (N+i)] : 

INT L: 

SEq 

L:=K+N2 

XT : =X [K] -X [L] 

YT:=Y[K]-Y[L] 

X [K] :=X[K]+X[L] 

Y[K] :=Y[K]+Y[L] 

X[L] :=(C*XT)+(S*YT) 

Y[L] :=(C*YT)-(S*XT) 

K:=K+N1 

J:=J+1 

I:»I+1 

— AR Spectral extimate 

VAL REAL32 NN IS (REAL32 ROUND N) : 

[N]REAL32 temp : 

INT p : 

SEq 

SEq 1=1 FOR N 
SEq 

S [I- 1] : =sigma/ ( (R [I] *R [I] ) + ( Q [I] *q [I] ) ) 
SEq 1=0 FOR N 
SEq 

p : =BITREVNBITS ( I , M) 
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temp [I] : = 10 . 0 (REAL32) +AL0G10 (S [p] ) 

f [I] :=(((REAL32 ROUND I)-(NN/2.0(REAL32)))*60.0(REAL32))/NN 
[S FROM 0 FOR N/2] := [temp FROM N/2 FOR N/2] 

[S FROM N/2 FOR N/2] := [temp FROM 0 FOR N/2] 


PROC MODCOVARCCHAN OF C from. pc, CHAN OF AR to. pc) 

#USE "snglmath.lib" 

#USE "cmplxlib" 

VAL N IS pulses : 

VAL twopi IS 6 . 283185307 179586 (REAL32) : 

VAL tick. rate IS 1000000. 0CREAL32) : j 

VAL REAL32 NN IS (REAL32 ROUND N) : j 

i 

INT M,tl,t2 : ! 

REAL32 rl , r2,r3,r4,P, delta, gamma, lambda. r, lambda. i , tmp. r , tmp. i : I 

[N+1]REAL32 x.r.x.i : 

[ORD+2] REAL32 a.r,a.i,c.r,c.i,d.r,d.i : 

[(N/2)+l]REAL32 cos. table, sin. table : 

TIMER clock : 

SEQ 

rl := 0.0CREAL32) 

M := 0 

SEQ 1=0 FOR ( (N/2)+l) 

REAL32 r : 

SEq 

r:=(((REAL32 ROUND I)*tvopi)/NN) 
cos. table [I] := COS(r) 
sin.table[I] := SIN(r) 

SEQ 1=1 FOR (ORD+1) 

SEQ 

a.r[l] := 0.0(REAL32) 
a.i[I] := 0.0CREAL32) 

from. pc ? [x.r FROM 1 FOR N];[x.i FROM 1 FOR N] 
clock ? tl 
SEQ 1=2 FOR (N-2) 

rl := rl+(2.0(REAL32)*CMag2(x.r[I] ,x.i[I])) 
r2 := CMag2(x.r[l] ,x.i[l]) 
r3 := CMag2(x.r[N] ,x.i[N]) 

r4 := 1 . 0(REAL32)/(rl+(2.0(REAL32)*(r2+r3))) 

P := rl+(r2+r3) 
delta := 1 .0(REAL32)-(r2*r4) 
gamma := 1.0(REAL32)-(r3*r4) 

CMul(x.r[l] ,x.i[l] ,x.r[N] ,x.i[N] ,tmp.r,tmp.i) 

CSmul (tmp . r , -tmp . i , r4 , lambda . r .lambda . i) 
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CSmul(x.r[N] ,x.i[N] ,r4,c.r[l] ,c.i[l]) 
CSmul(x.r[l] , -x . i [1] ,r4,d. r [1] ,d . i [1] ) 

Main Loop 


WHILE M<0RD 

REAL32 save 1. r, savel.i, theta. r, theta. i, PSI. r, PSI. i, dummy: 

REAL32 XI.r,XI.i,EF.r,EF.i,EB.r,EB.i,r5,tmpl.r,tmpl.i : 

REAL32 Cl.r,Cl.i,C2.r,C2.i,C3.r,C3.i,C4.r.C4.i : 

[ORD+l] REAL32 R.r.R.i : 

INT J : 

SEQ 

M := M+l 

savel.r := 0.0(REAL32) 
savel.i : = 0.0(REAL32) 

SEQ I=(M+1) FOR (N-M) 

SEQ 

CMuKx .r [I] ,x. i [I] ,x .r [I-M] , -x . i [I-M] .trnpl .r .trnpl . i) 
savel.r : = savel .r+tmpl .r 
savel.i : = savel .i+tmpl . i 
savel.r := 2.0(REAL32)*savel .r 
savel.i := 2.0(REAL32)*savel . i 
R.r[M] : = savel.r 
R.i[M] := -savel.i 

CMul(x. r[N] ,x.i[N] ,d.r[l] ,d.i[l] , theta. r, theta. i) 

CMul(x. r[N] ,x.i[N] ,c.r[i] ,c.i[l] ,PSI.r,PSI.i) 

CMul(x.r[l] , -x.i[l] ,d.r[l] ,d.i[l] .XI.r.XI.i) 

IF 

MOl 

SEq 1=1 FOR (M-l) 

REAL32 tmp .r ,tmp . i ,tmpl .r ,tmpl . i : 

SEQ 

CMuKx.r [N-I] ,x . i[N-I] ,d.r[I+l] ,d.i[I+l] .tmp.r.tmp.i) 
theta. r := theta. r+tmp.r 
theta. i := theta. i+tmp . i 

CMuKx. r[N-I] ,x.i[N-I] ,c.r[I+l] ,c.i[I+l] .tmp.r.tmp.i) 
PSI.r := PSI. r+tmp.r 
PSI.i := PSI. i+tmp. i 

CMul(x. r[I+l] ,-x.i[I+l] ,d.r[I+l] ,d.i[I+l] .tmp.r.tmp.i) 
XI. r := XI. r+tmp.r 
XI . i : = XI . i+tmp . i 

CMul(x. r[N+(l-M)] ,x . i [N+(l-M)] ,x.r[N+((l-M)+I)] . 

-x. i[N+((l-M)+I)] .tmp.r.tmp.i) 

CMuKx. r[M-I] ,x.i[M-I] ,x.r[M] .-x.i[M] ,tmpl .r ,tmpl . i) 
R.r [I] := R . r [I] - (tmp . r+tmpl . r) 

R.i[I] := R. i [I] -(tmp. i+tmpl. i) 

CMul(R.r[I] ,-R.iCl] ,a.r[M-I] ,a.i[M-I] .tmp.r.tmp.i) 
savel.r := savel .r+tmp .r 
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savel.i := savel . i+tmp . i 


M = 1 
SKIP 


Order update of coefficient (a) vector 

CSmul (-savel .r ,-savel . i, 1.0 (REAL32)/P f Cl. r,Cl .i) 
a.r[M] := Cl.r 
a.i[M] := Cl.i 

P : = P*(1.0(REAL32)-CMag2(Cl.r,Cl.i)) 

IF 

M <> 1 
INT MI : 

SEQ 1=1 FOR (M/2) 

REAL32 tmp. r, tmp. i: 

SEQ 

MI := M-I 
savel. r := a.r[I] 
savel.i := a.i[I] 

CMul(Cl .r,Cl .i,a.r[MI] ,-a.i[MI] ,tmp.r ,tmp. i) 

CAdd (save 1 . r .save 1 . i , tmp . r , tmp . i , a . r [I] , a . i [I] ) 

IF 

I <> MI 
SEq 

CMul (Cl . r , Cl . i , savel .r, -savel . i , tmp . r , tmp . i) 
a.r[MI] := a . r [MI] +tmp . r 
a.i[MI] := a . i [MI] +tmp . i 
I = MI 
SKIP 

M = 1 
SKIP 
IF 

M = ORD 

REAL32 time : 

[N+1]REAL32 S.f: 

SEq 

P := 0 . 5 (REAL32) * (P/(REAL32 ROUND (N-M))) 

FFT(cos .table, sin. table, S.f .a.r.a.i.P) 
clock ? t2 

time := (REAL32 ROUND (t2 MINUS tl))/tick.rate 
to. pc ! time; [f FROM 0 FOR N] ; [S FROM 0 FOR N] 

— Time update of C,D vectors and gamma, delta, lambda scalars 

M <> ORD 
SEq 

rl := 1 .0(REAL32)/((delta*gamma)- 
CMag2 (lambda . r .lambda . i) ) 
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CMul (theta . r , theta . i , lambda . r , -lambda . i , tmp . r , tmp . i ) 
CAdd(tmp.r,tmp.i, (delta*PSI .r) , (delta*PSI . i) , 
tmpl .r.tmpl . i) 

CSmuKtmpl .r.tmpl . i,rl ,C1 .r,Cl .i) 

CMul(PSI .r.PSI . i, lambda. r, lambda. i, tmp. r, tmp. i) 
CAdd(tmp ,r,tmp . i , (gamma*theta.r) , (gamma*theta. i) , 
tmpl .r, tmpl . i) 

CSmuKtmpl .r.tmpl. i,rl ,C2.r,C2.i) 

CMul (XI .r ,XI . i .lambda. r , -lambda. i, tmp. r, tmp. i) 

CAdd(tmp .r ,tmp . i , (delta*theta.r) , (delta*theta. i) , 
tmpl .r.tmpl . i) 

CSmul (tmpl. r, tmpl. i,rl,C3.r,C3.i) 

CMul (theta . r , theta . i , lambda . r .lambda . i , tmp . r , tmp . i) 

C Add ( tmp .r, tmp . i , (gamma*XI .r) , (gamma*XI . i) , 
tmpl .r.tmpl . i) 

CSmul (tmp 1 . r , tmpl . i , r 1 , C4 . r , C4 . i) 

SEQ 1=1 FOR ( (M-l)/3) 

INT MI : 

REAL32 save2.r,save2.i,save3.r,save3.i : 

REAL32 save4. r,save4.i, tmp. r, tmp. i, tmpl. r, tmpl. i : 
SEQ 

MI := (M+l)-I 
savel.r := c.r[I] 
savel.i := -c.i[I] 
save2.r := d.r[I] 
save2.i := -d.i[I] 
save3.r := c.r[MI] 
save3.i := -c.i[MI] 
save4.r := d.r[MI] 
save4.i := -d.i[MI] 

CMul (Cl .r ,C1 . i,save3.r,save3.i,tmp.r,tmp. i) 

CMul(C2 .r ,C2 . i ,save4.r,save4. i, tmpl .r.tmpl . i) 
c.r[I] := c.r[I]+(tmp.r+tmpl.r) 

c. i[I] := c.i[I]+(tmp.i+tmpl.i) 

CMul (C3 . r , C3 . i , save3 . r ,save3 . i , tmp . r ,tmp . i) 

CMul (C4 . r , C4 . i , save4 . r , save4 . i , tmpl . r.tmpl . i) 

d. r[I] := d.r[I]+(tmp.r+tmpl.r) 
d.i[I] := d.i[I]+(tmp.i+tmpl.i) 

IF 

I <> MI 
SEQ 

CMul (Cl. r, Cl. i, savel.r, savel.i, tmp. r, tmp. i) 
CMul (C2.r,C2.i,save2.r,save2.i, tmpl. r.tmpl .i) 
c.r[MI] := c.r[MI]+(tmp.r+tmpl.r) 

c. i[MI] := c.i[MI]+(tmp.i+tmpl.i) 

CMul (C3 . r , C3 . i , savel . r , savel . i ,tmp . r , tmp . i) 
CMul(C4.r,C4. i,save2.r,save2. i, tmpl. r.tmpl . i) 

d. r[MI] := d.r[MI]+(tmp.r+tmpl.r) 
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d.i[MI] := d.i[MI]+(tmp.i+tmpl.i) 

I = MI 
SKIP 

r2 := CMag2(PSI.r,PSI.i) 
r3 : = CMag2 (theta. r, theta. i) 
r4 := CMag2(XI.r,XI.i) 

CMul (PSI . r , PSI . i , lambda . r , lambda . i , tmp . r , tmp . i ) 
CMul(tmp .r ,tmp . i , theta. r, -theta. i ,tmpl .r ,tmpl . i) 
r5 : = gamma - (rl*((r2*delta)+((r3*gamma)+(2.0(REAL32) 
tmpl.r)))) 

CMul (theta . r , theta . i , lambda. r , lambda . i , tmp . r , tmp . i) 
CMul (tmp . r , tmp . i , XI . r , -XI . i , tmpl . r ,tmpl . i) 
r2 := delta - (rl*((r3*delta)+((r4*gamma)+(2.0(REAL32) 
tmpl.r)))) 
gamma := r5 
delta := r2 

CMul (C3 . r , C3 . i , PSI . r , -PSI . i , tmp . r , tmp . i) 

CMul(C4.r ,C4. i .theta. r, -theta. i, tmpl.r, tmpl . i) 
lambda. r :» lambda. r+ (tmp . r+tmpl .r) 
lambda. i : = lambda. i+ (tmp. i+tmpl.i) 

IF 

P<0.0(REAL32) 

STOP 

P>0.0(REAL32) 

SKIP 

IF 

(delta>0 . 0(REAL32) ) AND (deltaCl .0(REAL32) ) AND 
(gamma>0.0(REAL32)) AND (gamma<l .0(REAL32) ) 

SKIP 

TRUE 

STOP 

— Time update of "a" vector; order updates of c,d 

— vectors and gamma, delta, lambda scalars. 

rl := 1 . 0 (REAL32) /P 
r2 := 1.0(REAL32)/((delta*gamma)- 
CMag2 (lambda . r , lambda . i) ) 

EF.r : = x.r[M+l] 

EF.i := x.iCM+1] 

EB.r := x.r[N-M] 

EB.i := x . i [N-M] 

SEQ 1=1 FOR M 
SEQ 

CMul(a.r[I] ,a.i[I] ,x.r[(M+l)-I] ,x.i[(M+l)-I] , 
tmp. r, tmp. i) 

EF.r := EF.r+tmp.r 
EF.i := EF.i+tmp.i 



CMul(a.r[I] ,-a.i[I] ,x .r[(N-M)+I] ,x . i [(N-M)+I] , 
tmp . r , tmp . i ) 

EB . r : = EB . r+tmp . r 
EB . i : = EB . i+tmp . i 
CSmuKEB .r ,EB . i,rl ,C1 .r,Cl . i) 

CSmul (EF . r , -EF . i , r 1 , C2 . r , C2 . i ) 

CMul (EF.r.EF.i, lambda. r, lambda. i, tmp. r, tmp. i) 
CAdd(tmp .r ,tmp . i , (delta*EB.r) , ((-delta) *EB.i) , 
tmpl .r .tmpl . i) 

CSmul (tmpl .r ,tmpl . i,r2,C3 .r,C3.i) 

CMul (EB . r , EB . i .lambda . r .lambda . i , tmp . r , tmp . i) 
CAdd(tmp . r , -tmp . i , (gamma*EF . r) , (gamma*EF . i) , 
tmpl .r .trnpl . i) 

CSmul (tmp l.r, tmpl. i,r2,C4.r,C4.i) 

J := M 
WHILE J>0 

REAL32 tmp. r, tmp. i : 

SEQ 

savel.r := a.r[J] 
savel.i := a.i[J] 

CMul(C3.r,C3.i,c.r[J] ,c.i[J] , tmp. r, tmp . i) 
CAdd(savel .r.savel . i , tmp. r, tmp. i ,a.r[J] , a. i [J] ) 
CMul(C4.r,C4. i,d.r[J] ,d.i[J] .tmp.r.tmp.i) 
a.r[J] := a.r[J]+tmp.r 
a . i [ J] : = a . i [ J] +tmp . i 

CMul (Cl .r ,C1 . i .savel .r.savel . i.tmp .r.tmp . i) 
c.r[J+l] := c.r[J]+tmp.r 

c. i[J+l] := c.i[J]+tmp.i 

CMul (C2.r,C2.i,savel.r,s ave 1 . i , tmp . r , tmp . i ) 

d. r[J+l] := d.r[J]+tmp.r 
d.i[J+l] := d.i[J]+tmp.i 
J := J-l 

c.rCl] := Cl.r 

c. i[l] := Cl.i 

d. r[l] := C2.r 
d.i[l] := C2.i 

r3 := CMag2(EB.r,EB.i) 
r4 := CMag2 (EF.r.EF.i) 

CMul (EB . r ,EB . i , lambda . r , lambda. i ,tmp .r ,tmp . i) 

CMul (tmp. r, tmp. i, EF.r.EF.i, tmpl. r, tmpl. i) 

P := P - (r2*((r3*delta)+((r4*gamma)+ 

(2 . 0 (REAL32) *tmpl . r ) ) ) ) 
delta := delta-(r4*rl) 
gamma : = gamma- (r3*rl) 

CMul (EF . r ,EF . i ,EB . r , EB . i , tmp . r.tmp . i) 
lambda. r := lambda. r+(rl*tmp.r) 
lambda. i := lambda. i-(rl*tmp . i) 

IF 
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P<0 . 0(REAL32) 

STOP 

P>0 . 0(REAL32) 

SKIP 

IF 

(delta>0 . 0(REAL32) ) AND (delta<l .0(REAL32) ) AND 
(ganunaX) . 0 (REAL32) ) AND (gamma<l ,0(REAL32)) 

SKIP 

TRUE 

STOP 
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